GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader by pitrou · Pull Request #48925 · apache/arrow

pitrou · 2026-01-21T16:54:43Z

What changes are included in this PR?

Bug fixes and robustness improvements in the IPC file reader:

Fix bug reading variadic buffers with pre-buffering enabled
Fix bug reading dictionaries with pre-buffering enabled
Validate IPC buffer offsets and lengths

Testing improvements:

Exercise pre-buffering in IPC tests
Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
Exercise pre-buffering in the IPC file fuzz target

Miscellaneous:

Add convenience functions for integer overflow checking

Are these changes tested?

Yes, by existing and improved tests.

Are there any user-facing changes?

Bug fixes.

This PR contains a "Critical Fix". Fixes a potential crash reading variadic buffers with pre-buffering enabled.

GitHub Issue: [C++][CI] Fuzz IPC file metadata pre-buffering #48924

pitrou · 2026-01-21T20:54:27Z

@github-actions crossbow submit -g cpp

pitrou · 2026-01-22T14:41:14Z

@github-actions crossbow submit -g cpp

github-actions · 2026-01-22T14:44:01Z

Revision: 7749642

Submitted crossbow builds: ursacomputing/crossbow @ actions-cab35a473b

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-debian-experimental-cpp-gcc-15
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

pitrou · 2026-01-22T15:42:34Z

@lidavidm @WillAyd Would you like to take a look at this? The changes are non-trivial.

WillAyd · 2026-01-22T23:56:43Z

I'm not overly familiar with this part of Arrow, but generally things look good to me. Happy to offer an explicit approval if desired and no feedback from others

conbench-apache-arrow · 2026-01-26T18:53:51Z

After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit 8010794.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them.

conbench-apache-arrow · 2026-01-26T18:53:53Z

After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit 8010794.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them.

### What changes are included in this PR? Bug fixes and robustness improvements in the IPC file reader: * Fix bug reading variadic buffers with pre-buffering enabled * Fix bug reading dictionaries with pre-buffering enabled * Validate IPC buffer offsets and lengths Testing improvements: * Exercise pre-buffering in IPC tests * Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated * Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job * Exercise pre-buffering in the IPC file fuzz target Miscellaneous: * Add convenience functions for integer overflow checking ### Are these changes tested? Yes, by existing and improved tests. ### Are there any user-facing changes? Bug fixes. **This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled. * GitHub Issue: #48924 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968) ### Rationale for this change Cython built code is currently failing to compile on free threaded wheels due to: ``` /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’: /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous 43068 | __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL)); | ``` ### What changes are included in this PR? Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: #48965 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925) ### What changes are included in this PR? Bug fixes and robustness improvements in the IPC file reader: * Fix bug reading variadic buffers with pre-buffering enabled * Fix bug reading dictionaries with pre-buffering enabled * Validate IPC buffer offsets and lengths Testing improvements: * Exercise pre-buffering in IPC tests * Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated * Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job * Exercise pre-buffering in the IPC file fuzz target Miscellaneous: * Add convenience functions for integer overflow checking ### Are these changes tested? Yes, by existing and improved tests. ### Are there any user-facing changes? Bug fixes. **This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled. * GitHub Issue: #48924 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967) ### Rationale for this change The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled ### What changes are included in this PR? 1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields. 2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys. ### Are these changes tested? Manually on Windows, and CI ### Are there any user-facing changes? No * GitHub Issue: #48966 Authored-by: jianfengmao <jianfengmao@deephaven.io> Signed-off-by: David Li <li.davidm96@gmail.com> * GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692) ### Rationale for this change WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow. ### What changes are included in this PR? Early check the array is not all null values before serialize it ### Are these changes tested? Added tests. ### Are there any user-facing changes? No * GitHub Issue: #48691 Authored-by: rexan <rexan@apache.org> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948) ### Rationale for this change As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix ### What changes are included in this PR? - Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker. - Update `pymanager install` command to use newer API (old command fails with missing flags) - Update default python command to use the free-threaded required suffix if free-threaded wheels ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: #48947 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48990: [Ruby] Add support for writing date arrays (#48991) ### Rationale for this change There are date32 and date64 variants for date arrays. ### What changes are included in this PR? * Add `ArrowFormat::DateType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48990 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993) ### Rationale for this change It's a large variant of UTF-8 array. ### What changes are included in this PR? * Add `ArrowFormat::LargeUTF8Type#to_flatbuffers` * Add support for large UTF-8 array of `#values` and `#raw_records` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48992 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982) ### Rationale for this change `FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter. ### What changes are included in this PR? Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation: - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods - Deprecate the old Status/out-parameter overloads - Update C++ callers and R/Python/GLib bindings to use the new API ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. Status versions of FileReader::ReadRowGroup(s) have been deprecated. ```cpp virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, std::shared_ptr<::arrow::Table>* out); ``` * GitHub Issue: #48949 Lead-authored-by: fenfeng9 <fenfeng9@qq.com> Co-authored-by: fenfeng9 <36840213+fenfeng9@users.noreply.github.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989) ### Rationale for this change Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC. ### What changes are included in this PR? * Refer arguments of `garrow_filter_node_options_new()` * Refer arguments of `garrow_project_node_options_new()` * Refer arguments of `garrow_aggregate_node_options_new()` * Refer arguments of `garrow_literal_expression_new()` * Refer arguments of `garrow_call_expression_new()` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48985 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007) ### Rationale for this change When looking for the wheel the script was falling back to returning a 404 even when the wheel was found: ``` + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome 127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found ``` Timing out the job and failing. ### What changes are included in this PR? Correct logic and only return 404 if the file requested wasn't found. ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: #47692 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974) ### Rationale for this change Benchmark failing since C++20 upgrade due to lack of C++20 configuration ### What changes are included in this PR? Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach. Description as follows: > conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. > > This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present. ### Are these changes tested? I got :robot: to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly. > Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch. > > The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding. ### Are there any user-facing changes? Nope * GitHub Issue: #48912 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718) ### Rationale for this change Fixes https://github.com/apache/arrow/issues/36889 When writing CSV from a table where the first batch is empty, the header gets written twice: ```python table = pa.table({"col1": ["a", "b", "c"]}) combined = pa.concat_tables([table.schema.empty_table(), table]) write_csv(combined, buf) # Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n <-- header appears twice ``` ### What changes are included in this PR? The bug happens because: 1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization 2. The buffer is not cleared after flush 3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_` 4. The write loop then writes `data_buffer_` which still contains stale content The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths: - `WriteHeader()` - `WriteRecordBatch()` - `WriteTable()` This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again. ### Are these changes tested? Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`: - Empty batch at start of table - Empty batch in middle of table ### Are there any user-facing changes? No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches. * GitHub Issue: #36889 Lead-authored-by: Ruiyang Wang <ruiyang@anthropic.com> Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933) ### Rationale for this change #48932 ### What changes are included in this PR? - Fix `rsync` build error ODBC Nightly Package ### Are these changes tested? - tested in CI ### Are there any user-facing changes? - After fix, users should be able to get Nightly ODBC package release * GitHub Issue: #48932 Authored-by: Alina (Xi) Li <alina.li@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48951: [Docs] Add documentation relating to AI tooling (#48952) ### Rationale for this change Add guidance re AI tooling ### What changes are included in this PR? Updates to main docs and links to it from new contributor's guide ### Are these changes tested? No but I'll built the docs ### Are there any user-facing changes? Just docs :robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness. * GitHub Issue: #48951 Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49029: [Doc] Run sphinx-build in parallel (#49026) ### Rationale for this change `sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs). ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #49029 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-33450: [C++] Remove GlobalForkSafeMutex (#49033) ### Rationale for this change This functionality is unused now that we have a proper atfork facility. ### Are these changes tested? By existing CI tests. ### Are there any user-facing changes? Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal). * GitHub Issue: #33450 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956) ### Rationale for this change The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete. It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies. The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved. ### What changes are included in this PR? Removed the outdated TODO comment that referenced GH-35437. ### Are these changes tested? I did not test. ### Are there any user-facing changes? No. * GitHub Issue: #35437 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008) ### Rationale for this change When running the python-sdist job we are currently not uploading the build artifact to the job. ### What changes are included in this PR? Upload artifact as part of building the job so it's easier to test and validate contents if necessary. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: #48586 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039) ### Rationale for this change CI needs updating to test old R package versions ### What changes are included in this PR? Add 22.0.0.1 ### Are these changes tested? Nah, it's CI stuff ### Are there any user-facing changes? No Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969) ### Rationale for this change See issue #48961 Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default ### What changes are included in this PR? Updating several doctest examples from `string` to `large_string`. ### Are these changes tested? Yes, locally. ### Are there any user-facing changes? No. Closes #48961 * GitHub Issue: #48961 Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-49037: [Benchmarking] Install R from non-conda source for benchmarking (#49038) ### Rationale for this change Slow benchmarks due to conda duckdb building from source ### What changes are included in this PR? Try ditching conda and installing R via rig and using PPM binaries ### Are these changes tested? I'll try running ### Are there any user-facing changes? Nope * GitHub Issue: #49037 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49042: [C++] Remove mimalloc patch (#49041) ### Rationale for this change This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139 ### Are these changes tested? By existing CI. ### Are there any user-facing changes? No. * GitHub Issue: #49042 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49024: [CI] Update Debian version in `.env` (#49032) ### Rationale for this change Default Debian version in `.env` now maps to oldstable, we should use stable instead. Also prune entries that are not used anymore. ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #49024 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49027: [Ruby] Add support for writing time arrays (#49028) ### Rationale for this change There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays. ### What changes are included in this PR? * Add `ArrowFormat::TimeType#to_flatbuffers` * Add bit width information to `ArrowFormat::TimeType` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49027 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49030: [Ruby] Add support for writing fixed size binary array (#49031) ### Rationale for this change It's a fixed size variant of binary array. ### What changes are included in this PR? * Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers` * Add `ArrowFormat::FixedSizeBinaryArray#each_buffer` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49030 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867) ### Rationale for this change Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations. ### What changes are included in this PR? - Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error - Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #48866 Authored-by: Arkadii Kravchuk <arkadii.kravchuk@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674) ### Rationale for this change This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers. I could not find the relevant example to demonstrate within this project but assume that we have a test such as: (Generated by ChatGPT) ```cpp TEST(BlockParser, ErrorMessageWithColonsPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value\n" "parser_test.cc:940 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } // Test with URL-like data (another common case with colons) TEST(BlockParser, ErrorMessageWithURLPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api\n" "parser_test.cc:974 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } ``` then it fails. ### What changes are included in this PR? Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped. ### Are these changes tested? Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`. ### Are there any user-facing changes? No, test-only. * GitHub Issue: #48673 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052) ### Rationale for this change See: #49044 ### What changes are included in this PR? Urllib now request with `"user-agent": "pyarrow"` ### Are these changes tested? It's a CI fix. ### Are there any user-facing changes? No, just a CI test fix. * GitHub Issue: #49044 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988) ### Rationale for this change Currently the files are missing from the published wheels. ### What changes are included in this PR? - Ensure the license and notice files are part of the wheels - Use build frontend to build wheels - Build wheel from sdist ### Are these changes tested? Yes, via archery. I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing: ``` AssertionError: LICENSE.txt is missing from the wheel. ``` ### Are there any user-facing changes? No * GitHub Issue: #48983 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060) ### Rationale for this change Fix two issues found by OSS-Fuzz in the IPC reader: * a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984 * a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408 None of these two issues is a security issue. ### Are these changes tested? Yes, by new unit tests and new fuzz regression files. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #49059 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056) ### Rationale for this change Decimal128/256 arrays are only supported. ### What changes are included in this PR? Add `ArrowFormat::DecimalType#to_flatbuffers`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49055 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49053: [Ruby] Add support for writing timestamp array (#49054) ### Rationale for this change It has `unit` and `time_zone` parameters. ### What changes are included in this PR? * Add `ArrowFormat::TimestampType#to_flatbuffers` * Set time zone when GLib timestamp type is converted from C++ timestamp type * Use `time_zone` not `timezone` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49053 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst` * GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066) ### Rationale for this change The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types. ### What changes are included in this PR? Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership. ### Are these changes tested? Yes, existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #49065 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063) ### Rationale for this change Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed. ### What changes are included in this PR? Refactor the Engine class to only create one target machine and pass that to the necessary functions. Before the change (3 TargetMachines created): First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout. Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler. Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine. After the change (1 TargetMachine created): The key changes are: Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine). Use SimpleCompiler instead of TMOwningSimpleCompiler: SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created. A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance. ### Are these changes tested? Yes, unit and integration. ### Are there any user-facing changes? No. * GitHub Issue: #48159 Lead-authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com> Co-authored-by: Logan Riggs <logan.riggs@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049) ### Rationale for this change Prevent bugs similar to https://github.com/apache/arrow/issues/49043 ### What changes are included in this PR? - Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`. - Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response. ### Are these changes tested? Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression. The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce. ### Are there any user-facing changes? Some low probability bugs will be gone. No interface changes. * GitHub Issue: #49043 Authored-by: Thomas Newton <thomas.w.newton@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035) ### Rationale for this change The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error. ### What changes are included in this PR? Add kCanReturnErrors to the function definition to match other string functions. Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation. Add a unit test. ### Are these changes tested? Yes, unit and integration testing. ### Are there any user-facing changes? No. * GitHub Issue: #49034 Authored-by: Logan Riggs <logan.riggs@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981) ### Rationale for this change Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11). ### What changes are included in this PR? Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments. ### Are these changes tested? Yes, through CI build and existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #48980 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49069: [C++] Share Trie instances across CSV value decoders (#49070) ### Rationale for this change The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead. ### What changes are included in this PR? - Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie) - Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries - Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders ### Are these changes tested? Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage. ### Are there any user-facing changes? No. * GitHub Issue: #49069 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49076: [CI] Update vcpkg baseline to newer version (#49062) ### Rationale for this change The current version of vcpkg used is a from April 2025 ### What changes are included in this PR? Update baseline to newer version. ### Are these changes tested? Yes on CI. I've validated for example that xsimd 14 will be pulled. ### Are there any user-facing changes? No * GitHub Issue: #49076 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49074: [Ruby] Add support for writing interval arrays (#49075) ### Rationale for this change There are year month/day time/month day nano variants. ### What changes are included in this PR? * Add `ArrowFormat::IntervalType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49074 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49071: [Ruby] Add support for writing list and large list arrays (#49072) ### Rationale for this change They use different offset size. ### What changes are included in this PR? * Add `ArrowFormat::ListType#to_flatbuffers` * Add `ArrowFormat::LargeListType#to_flatbuffers` * Add `ArrowFormat::VariableSizeListArray#child` * Add `ArrowFormat::VariableSizeListArray#each_buffer` * `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist * Add `garrow_list_array_get_value_offsets_buffer()` * Add `garrow_large_list_array_get_value_offsets_buffer()` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49071 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091) ### Rationale for this change LLVM 15 or earlier uses `llvm::Optional` not `std::optional`. ### What changes are included in this PR? Use `llvm::Optional` with LLVM 15 or earlier. ### Are these changes tested? Yes, compiling. ### Are there any user-facing changes? No * GitHub Issue: #49087 Authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101) ### Rationale for this change The Swift documentation link in the implementations.rst file was broken and returned a 404 error. ### What changes are included in this PR? Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow) ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #49100 Lead-authored-by: ChiLin Chiu <chilin.chiou@gmail.com> Co-authored-by: Chilin <chilin.cs07@nycu.edu.tw> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49096: [Ruby] Add support for writing struct array (#49097) ### Rationale for this change It's a nested array. ### What changes are included in this PR? * Add `ArrowFormat::StructType#to_flatbuffers` * Add `ArrowFormat::StructArray#each_buffer` * Add `ArrowFormat::StructArray#children` * Fix `ArrowFormat::Array#n_nulls` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49096 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49093: [Ruby] Add support for writing duration array (#49094) ### Rationale for this change It has unit parameter. ### What changes are included in this PR? * Add `ArrowFormat::DurationType#to_flatbuffers` * Add duration support to `#values` and `raw_records` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49093 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099) ### Rationale for this change Documents for libarrow-cuda-glib are generated but they aren't packaged. ### What changes are included in this PR? Package documents for libarrow-cuda-glib. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49098 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48764: [C++] Update xsimd (#48765) ### Rationale for this change Homogenized versions used ### What changes are included in this PR? Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines. ### Are these changes tested? Yes, with current CI. In fact due to the absence of pin, part of the CI already runs xsimd 14. ### Are there any user-facing changes? No. * GitHub Issue: #48764 Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047) ### Rationale for this change As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken. ### What changes are included in this PR? Remove asv benchmarking related files and docs. ### Are these changes tested? No, Validate CI and run preview-docs to validate docs. ### Are there any user-facing changes? No * GitHub Issue: #46008 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109) ### Rationale for this change `SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix. ### What changes are included in this PR? Add f prefix to the string in `SparseCOOTensor.__repr__`. ### Are these changes tested? Yes, work after adding. f-string prefix: ```python3 >>> import pyarrow as pa >>> import numpy as np >>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32) >>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor) >>> sparse_coo <pyarrow.SparseCOOTensor> type: float shape: (2, 3) ``` ### Are there any user-facing changes? a bug that caused incorrect or invalid data to be produced: ```python3 >>> import pyarrow as pa >>> import numpy as np >>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32) >>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor) >>> sparse_coo <pyarrow.SparseCOOTensor> type: {self.type} shape: {self.shape} ``` * GitHub Issue: #49108 Authored-by: Chilin <chilin.cs07@nycu.edu.tw> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126) ### Rationale for this change Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel). ### What changes are included in this PR? Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025. ### Are these changes tested? Yes, with extendeed dask build. ### Are there any user-facing changes? No. * GitHub Issue: #49083 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49117: [Ruby] Add support for writing union arrays (#49118) ### Rationale for this change There are dense and sparse variants. ### What changes are included in this PR? * Add `garrow_union_array_get_n_fields()` * Add `ArrowFormat::UnionArray#children` * Add `ArrowFormat::DenseUnionArray#each_buffer` * Add `ArrowFormat::SparseUnionArray#each_buffer` * Add `ArrowFormat::UnionType#to_flatbuffers` * Add `Arrow::UnionArray#fields` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49117 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49119: [Ruby] Add support for writing map array (#49120) ### Rationale for this change It's a list based array. ### What changes are included in this PR? * Add `ArrowFormat::MapType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49119 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48922: [C++] Support Status-returning callables in Result::Map (#49127) ### Rationale for this change Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases. ### What changes are included in this PR? - Added EnsureResult specialization to allow Map to return Status directly. - Added unit tests to verify success/error propagation and return type resolution. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * GitHub Issue: #48922 Authored-by: Abhishek Bansal <abhibansal593@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095) ### Rationale for this change This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal. `fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases. ### What changes are included in this PR? Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion. ### Are these changes tested? Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}. ### Are there any user-facing changes? It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`. With this patch, the CSV reader in PyArrow outputs: ```python >>> import pyarrow >>> import pyarrow.csv >>> import io >>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode())) >>> print(table) pyarrow.Table data: double ---- data: [[0,inf,-inf]] ``` Closes #49003 * GitHub Issue: #49003 Authored-by: Alvaro-Kothe <kothe65@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943) ### Rationale for this change The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling. ### What changes are included in this PR? Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629. Added that function as an util. ### Are these changes tested? There are existent tests for JSON. ### Are there any user-facing changes? No, test-only. * GitHub Issue: #48941 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49067: [R] Disable GCS on macos (#49068) ### Rationale for this change Builds that complete on CRAN ### What changes are included in this PR? Disable GCS by default ### Are these changes tested? ### Are there any user-facing changes? Hopefully not **This PR includes breaking changes to public APIs.** (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.) **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #49067 --------- Co-authored-by: Nic Crane <thisisnic@gmail.com> * GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116) ### Rationale for this change Current wheels are failing to be built due to old version of vcpkg failing with our latest main. ### What changes are included in this PR? - Update vcpkg version. - Update patches - Add `perl-Time-Piece` to some images as required to build newer OpenSSL. ### Are these changes tested? Yes on CI ### Are there any user-facing changes? No * GitHub Issue: #49115 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955) ### Rationale for this change Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this: ```python import pyarrow as pa import pyarrow.compute as pc pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32())) # [0, 1, 2, 3] pc.array_sort_indices(pa.DictionaryArray.from_arrays( indices=pa.array([None, None, None, None], type=pa.int8()), dictionary=pa.array([], type=pa.null()) )) # [0, 1, 2, 3] ``` I believe it does not make sense to specifically disallow this in dictionaries at this point. ### What changes are included in this PR? Added a unittest for null sorting behaviour. ### Are these changes tested? Yes, the unittest was added. ### Are there any user-facing changes? No. * GitHub Issue: #48954 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-36193: [R] arm64 binaries for R (#48574) ### Rationale for this change Issues building on ARM ### What changes are included in this PR? CI job and nixlibs update ### Are these changes tested? On CI ### Are there any user-facing changes? No AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc) * GitHub Issue: #36193 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-48397: [R] Update docs on how to get our libarrow builds (#48995) ### Rationale for this change Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that. ### What changes are included in this PR? Update docs. ### Are these changes tested? Will preview docs build. ### Are there any user-facing changes? Just docs. * GitHub Issue: #48397 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105) ### Rationale for This Change The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ. ### What Changes Are Included in This PR? This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access. ### Are These Changes Tested? Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically. ### Are There Any User-Facing Changes? No. This change improves internal safety and robustness without altering public APIs or observable user behavior. * GitHub Issue: #49104 Lead-authored-by: Alirana2829 <alimahmoodrana00@gmail.com> Co-authored-by: Ali Mahmood Rana <159713825+AliRana30@users.noreply.github.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Rok Mihevc <rok@mihevc.org> * MINOR: [Docs] Add links to AI-generated code guidance (#49131) ### Rationale for this change Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though ### What changes are included in this PR? Add link to AI-generated code guidance ### Are these changes tested? No ### Are there any user-facing changes? No Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * MINOR: [R] Add new vignette to pkgdown config (#49145) ### Rationale for this change CI failing on preview-docs; see #49141 ### What changes are included in this PR? Add the vignette created in #49068 to pkgdown config ### Are these changes tested? I'll trigger CI ### Are there any user-facing changes? Nah Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088) Fixes: #49150 See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381 ### Rationale for this change Fix CI failures ### What changes are included in this PR? Tests are made more general to allow for Pandas 2 and Pandas 3 style string types ### Are these changes tested? By CI ### Are there any user-facing changes? No * GitHub Issue: #49150 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Rok Mihevc <rok@mihevc.org> * GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971) Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix. ### Rationale for this change I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced. ### What changes are included in this PR? AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang. This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type. Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above. ### Are these changes tested? I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows. One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it. ### Are there any user-facing changes? Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those. * GitHub Issue: #41990 Lead-authored-by: Nate Prewitt <nateprewitt@microsoft.com> Co-authored-by: Nate Prewitt <nate.prewitt@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139) ### Rationale for this change We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138 ### What changes are included in this PR? Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2). ### Are these changes tested? Tes. ### Are there any user-facing changes? No. * GitHub Issue: #49138 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769) ### Rationale for this change Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics). The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working. ### What changes are included in this PR? - Changed validation from `start >= stop` to `start > stop` - Updated error message - Added test cases ### Are these changes tested? Yes, tests were added. ### Are there any user-facing changes? Yes. ```python import pyarrow.compute as pc pc.list_slice([[1,2,3]], 0, 0) ``` Before: ``` pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0) ``` After: ``` <pyarrow.lib.ListArray object at 0x1a01b8b20> [ [] ] ``` * GitHub Issue: #33459 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135) Closes https://github.com/apache/arrow/issues/41863 ### Rationale for this change Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/ `LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec: ``` ArrowException: Unsupported compression: lz4_raw ``` This is a friction issue, and confusing for some users who are aware of the differences. ### What changes are included in this PR? - Adding `LZ4_RAW` to the acceptable codec names list. - Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`. - Adding a test ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, an additive change to the accepted codec names. * GitHub Issue: #41863 Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-48868: [Doc] Document security model for the Arrow formats (#48870) ### Rationale for this change Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those. ### What changes are included in this PR? Add a Security Considerations page in the Format section. **Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html ### Are these changes tested? N/A ### Are there any user-facing changes? No. * GitHub Issue: #48868 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005) ### Rationale for this change #49004 ### What changes are included in this PR? - Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI. Note: `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050 ### Are these changes tested? Yes, in CI ### Are there any user-facing changes? N/A * GitHub Issue: #49004 Lead-authored-by: Alina (Xi) Li <alina.li@improving.com> Co-authored-by: Alina (Xi) Li <96995091+alinaliBQ@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151) ### Rationale for this change #49092 ### What changes are included in this PR? - Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly. Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`. ### Are these changes tested? Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26 ### Are there any user-facing changes? Yes, the nightly ODBC file names will be changed as described above. * GitHub Issue: #49092 Authored-by: Alina (Xi) Li <alina.li@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49156: [Python] Require GIL for string comparison (#49161) ### Rationale for this change With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL. ### What changes are included in this PR? Moving statement out of the `with nogil` context manager. ### Are these changes tested? Existing CI builds pyarrow. ### Are there any user-facing changes? No * GitHub Issue: #49156 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577) ### Rationale for this change #48575 ### What changes are included in this PR? - Add new ODBC workflow for macOS Intel 15 and 14 arm64. - Added ODBC build fixes to enable build on macOS CI. ### Are these changes tested? Tested in CI and local macOS Intel and M1 environments. ### Are there any user-facing changes? N/A * GitHub Issue: #48575 Lead-authored-by: Alina (Xi) Li <alina.li@improving.com> Co-authored-by: justing-bq <62349012+justing-bq@users.noreply.github.com> Co-authored-by: Victor Tsang <victor.tsang@improving.com> Co-authored-by: Alina (Xi) Li <alinal@bitquilltech.com> Co-authored-by: vic-tsang <victor.tsang@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165) ### Rationale for this change Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments. ### What changes are included in this PR? Use the variable name directly (no `${}`). ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: #49164 Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48132: [Ruby] Add support for writing dictionary array (#49175) ### Rationale for this change Delta dictionary message support is out of scope. ### What changes are included in this PR? * Add `ArrowFormat::DictionaryArray#each_buffer` * Add `ArrowFormat::DictionaryType#build_fb_type` * Add support for dictionary message in `ArrowFormat::StreamingWriter` * Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48132 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49081: [C++][Parquet] Correct variant's extension name (#49082) ### Rationale for this change Correct variant extension according to arrow's specification. ### What changes are included in this PR? Modified variant's hardcoded extension name. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #49081 Authored-by: Zehua Zou <zehuazou2000@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618) ### Rationale for this change This is the first in series of PRs adding type annotations to pyarrow and resolving #32609. ### What changes are included in this PR? This PR establishes infrastructure for type checking: - Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows - Configures type checkers to validate stub files (excluding source files for now) - Adds PEP 561 `py.typed` marker to enable type checking - Updates wheel build scripts to include stub files in distributions - Creates initial minimal stub directory structure - Updates developer documentation with type checking workflow ### Are these changes tested? No. This is mostly a CI change. ### Are there any user-facing changes? This does not add any actual annotations (only `py.typed` marker) so user should not be affected. * GitHub Issue: #32609 * GitHub Issue: #49102 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> * GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192) ### Rationale for this change See #49190 ### What changes are included in this PR? Fix `unknown job 'odbc' error` caused by typo ### Are these changes tested? Tested in CI ### Are there any user-facing changes? N/A * GitHub Issue: #49190 Authored-by: Alina (Xi) Li <alinal@bitquilltech.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191) Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p> <blockquote> <h2>v3.7.0</h2> <ul> <li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li> <li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li> <li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect…

github-actions bot added Component: C++ awaiting review Awaiting review labels Jan 21, 2026

This comment was marked as outdated.

Sign in to view

pitrou force-pushed the gh48924-fuzz-metadata-buffering branch 6 times, most recently from c559b54 to 7749642 Compare January 22, 2026 14:31

apacheGH-48924: [C++][CI] Fuzz IPC file metadata pre-buffering

a4ae909

pitrou force-pushed the gh48924-fuzz-metadata-buffering branch from 7749642 to a4ae909 Compare January 22, 2026 15:08

pitrou changed the title ~~GH-48924: [C++][CI] Fuzz IPC file metadata pre-buffering~~ GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader Jan 22, 2026

pitrou marked this pull request as ready for review January 22, 2026 15:38

pitrou requested review from assignUser, jonkeane, kou and raulcd as code owners January 22, 2026 15:38

lidavidm approved these changes Jan 23, 2026

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jan 23, 2026

pitrou merged commit 8010794 into apache:main Jan 26, 2026
53 of 54 checks passed

pitrou removed the awaiting merge Awaiting merge label Jan 26, 2026

pitrou mentioned this pull request Jan 26, 2026

[C++][CI] Fuzz IPC file metadata pre-buffering #48924

Closed

pitrou deleted the gh48924-fuzz-metadata-buffering branch January 26, 2026 13:29

pitrou added Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. backport-candidate labels Jan 29, 2026

raulcd removed the backport-candidate label Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader#48925

GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader#48925
pitrou merged 1 commit intoapache:mainfrom
pitrou:gh48924-fuzz-metadata-buffering

pitrou commented Jan 21, 2026 •

edited

Loading

Uh oh!

pitrou commented Jan 21, 2026

Uh oh!

This comment was marked as outdated.

pitrou commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

pitrou commented Jan 22, 2026

Uh oh!

WillAyd commented Jan 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Jan 26, 2026

Uh oh!

conbench-apache-arrow bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

pitrou commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pitrou commented Jan 21, 2026

Uh oh!

This comment was marked as outdated.

pitrou commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

pitrou commented Jan 22, 2026

Uh oh!

WillAyd commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Jan 26, 2026

Uh oh!

conbench-apache-arrow bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pitrou commented Jan 21, 2026 •

edited

Loading

WillAyd commented Jan 22, 2026 •

edited

Loading